Smart Watch Data Analysis¶

Purpose & Objectives¶

Purpose¶

The purpose of this project is to analyze smartwatch data to predict activity types and calories burned using various machine learning models. The specific objectives are: To understand the relationship between different features (steps, distance, heart rate, etc.) and the target variables (activity type and calories burned). To develop and evaluate machine learning models to accurately predict activity types and calories burned. To provide actionable insights and recommendations based on the analysis.

Scope¶

The scope of the analysis includes:

Exploratory Data Analysis (EDA) to understand the data distribution, detect any anomalies, and explore relationships between features. Preprocessing steps such as handling missing values, encoding categorical variables, and feature scaling. Development of machine learning models, including Random Forest Classifier for activity type prediction and Stacking Regressor for calorie prediction. Evaluation of model performance using appropriate metrics. Generation of insights and recommendations based on the analysis results.

Objectives¶

  1. Data Collection and Integration:

Ensure seamless collection and integration of data from various smartwatch models and other health platforms.

  1. Data Processing and Storage:

Ensure cleaning, preprocessing, and storing the data in a robust database.

  1. Data Interpretation and Visualization:

Develop intuitive visualizations and clear interpretations of the data to help users easily understand their health metrics.

  1. Personalized Recommendations:

Create algorithms that offer personalized health and fitness recommendations based on the user's unique data, goals, and preferences.

  1. Real-time Feedback and Alerts:

Implement real-time data analysis to provide users with immediate feedback and alerts for activities, heart rate, sleep, etc.

  1. Trend Analysis and Insights:

Identify long-term trends and patterns in users data to provide insights into their health and fitness progress over time.

  1. User Engagement and Motivation:

Develop features to keep users engaged with their health data, such as gamification, social sharing, and regular updates.

  1. Accuracy and Reliability:

Ensure the accuracy and reliability of the collected data and the insights generated, to build user trust and confidence.

  1. Privacy and Security:

Implement robust privacy and security measures to protect users' data and comply with relevant regulations.

Problem Aim to Solve¶

The core problem aims to solve in this "Smartwatch Data Analysis" project might be to make sense of the vast amount of data collected by smartwatches and turn it into actionable insights for users.To be more specific:

  • Data Overload:

Problem: Users receive an overwhelming amount of data from their smartwatches, including steps, heart rate, sleep patterns, and more.

Solution: Synthesizing and presenting this data in a clear, understandable, and actionable way to help users make informed decisions about their health.

  • Personalization:

Problem: Not all users have the same fitness goals or health needs.

Solution: Developing a system that tailors insights and recommendations to individual users based on their unique data and preferences.

  • Integration:

Problem: Difficulty in integrating smartwatch data with other health and fitness platforms and apps.

Solution: Ensuring compatibility and seamless data sharing across different ecosystems to provide a holistic view of users' health.

  • Real-time Analysis:

Problem: Users want immediate feedback on their activities but often face delays.

Solution: Implementing real-time data analysis and instant alerts or recommendations to improve user engagement and timely adjustments.

  • Data Accuracy:

Problem: Inaccurate data can lead to incorrect insights and recommendations.

Solution: Ensuring the accuracy and reliability of the collected data to build user trust and confidence.

  • User Engagement:

Problem: Keeping users engaged with their health data over the long term can be challenging.

Solution: Incorporating gamification, social features, and regular updates to maintain interest and motivation.

By addressing these problems, this project aims to enhance the user experience, promote healthier lifestyles and provide valuable insights that users can trust and act upon.

Importance and Relevance of the Project:¶

  • Enhanced Health Awareness:

Empowering Users: By translating raw data into clear insights, users can gain a deeper understanding of their health metrics and make informed decisions to improve their well-being.

Preventive Health: Real-time data and alerts can help users identify potential health issues early, allowing for timely intervention and preventive measures.

  • Promoting Healthy Lifestyles:

Motivation: Personalized recommendations and trend analysis can motivate users to stay active, eat healthier, and adopt better sleep habits.

Engagement: Gamification and social features can make fitness and health tracking more engaging and enjoyable, leading to sustained healthy behaviors.

  • Data-Driven Insights:

Accuracy: Leveraging accurate and reliable data ensures that users can trust the insights and recommendations provided by their smartwatches.

Personalization: Tailored insights cater to individual needs, goals, and preferences, making the data more relevant and actionable.

  • Integration and Compatibility:

Seamless Experience: Integrating smartwatch data with other health and fitness platforms creates a cohesive and comprehensive view of users' health.

Holistic Health: Combining data from various sources provides a holistic understanding of health and fitness, enabling users to see the bigger picture.

  • Technological Advancement:

Innovation: Developing advanced algorithms and visualizations pushes the boundaries of what smartwatches can do, contributing to technological progress.

User-Centric Design: Focusing on user needs and preferences drives innovation in the design and functionality of health-tracking devices.

  • Public Health Impact:

Population Health: Aggregated and anonymized data from smartwatches can provide valuable insights into population health trends, aiding public health initiatives.

Research and Development: The data can be used for research purposes, contributing to the development of new health interventions and technologies.

Summary¶

The approach to achieving the objectives of the smartwatch data analysis project involves:

  1. Data Collection and Integration: Seamlessly gathering and integrating data from various smartwatches and health platforms using APIs.

  2. Data Processing and Storage: Cleaning, preprocessing, and storing the data in a robust database.

  3. Data Interpretation and Visualization: Developing algorithms for data interpretation and creating an intuitive user interface for clear visualizations.

  4. Personalized Recommendations: Building user profiles and a recommendation engine to offer tailored health and fitness suggestions.

  5. Real-time Feedback and Alerts: Implementing real-time data processing for immediate feedback and alerts through a notification system.

  6. Trend Analysis and Long-term Insights: Analyzing historical data for trend identification and generating detailed progress reports.

  7. User Engagement and Motivation: Incorporating gamification elements and social features to keep users motivated and engaged.

  8. Accuracy and Reliability: Ensuring data accuracy and reliability through continuous validation and user feedback.

  9. Privacy and Security: Implementing robust encryption and compliance with privacy regulations to protect user data.

This approach ensures the project effectively provides valuable health and fitness insights to users, enhancing their overall well-being.

Part 1 : Data Cleaning and Prerocessing¶

Import Libraries¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, StackingRegressor
from sklearn.datasets import fetch_california_housing
import lightgbm as lgb
from sklearn.compose import ColumnTransformer
import warnings
import joblib

Code Explanation: These libraries are essential for handling data (Pandas, NumPy), creating visualizations (Matplotlib, Seaborn), and building and evaluating machine learning models (Scikit-learn).

Why it's important:Libraries consist of pre-written code that you can use in your projects. This means you don't have to write everything from scratch, saving time and effort.And provide a vast range of functionalities that might be difficult or time-consuming to implement on your own. This includes everything from complex mathematical functions to data visualization tools.

Load Dataset¶

In [139]:
data = pd.read_csv('smartwatch.csv')
data.head()
Out[139]:
Unnamed: 0 X1 age gender height weight steps hear_rate calories distance entropy_heart entropy_setps resting_heart corr_heart_steps norm_heart intensity_karvonen sd_norm_heart steps_times_distance device activity
0 1 1 20 1 168.0 65.4 10.771429 78.531302 0.344533 0.008327 6.221612 6.116349 59.0 1.000000 19.531302 0.138520 1.000000 0.089692 apple watch Lying
1 2 2 20 1 168.0 65.4 11.475325 78.453390 3.287625 0.008896 6.221612 6.116349 59.0 1.000000 19.453390 0.137967 1.000000 0.102088 apple watch Lying
2 3 3 20 1 168.0 65.4 12.179221 78.540825 9.484000 0.009466 6.221612 6.116349 59.0 1.000000 19.540825 0.138587 1.000000 0.115287 apple watch Lying
3 4 4 20 1 168.0 65.4 12.883117 78.628260 10.154556 0.010035 6.221612 6.116349 59.0 1.000000 19.628260 0.139208 1.000000 0.129286 apple watch Lying
4 5 5 20 1 168.0 65.4 13.587013 78.715695 10.825111 0.010605 6.221612 6.116349 59.0 0.982816 19.715695 0.139828 0.241567 0.144088 apple watch Lying

Code Explanation:The dataset is loaded into a DataFrame, and the first few rows are displayed understand the structure and content of the data.

Why it's important: Understanding the structure of the data is the first step in any data analysis or machine learning project. It allows you to familiarize yourself with the types of variables, their formats, and the overall organization of the data.

Task 1: Inspect the Dataset: Check for missing values, duplicates, and understand the distribution of each column.¶

In [140]:
print("\nMissing values in each column:")
print(data.isnull().sum())
Missing values in each column:
Unnamed: 0              0
X1                      0
age                     0
gender                  0
height                  0
weight                  0
steps                   0
hear_rate               0
calories                0
distance                0
entropy_heart           0
entropy_setps           0
resting_heart           0
corr_heart_steps        0
norm_heart              0
intensity_karvonen      0
sd_norm_heart           0
steps_times_distance    0
device                  0
activity                0
dtype: int64

Code Explanation:The code prints a message indicating the next output will show the count of missing values and then calculates and displays the number of missing values in each column of the DataFrame. Identifying missing values helps us understand which columns need imputation or removal.

Why it's important: Identifying missing data is crucial for data cleaning and preprocessing, ensuring the quality and completeness of your dataset. Properly handling missing data leads to more accurate analyses and reliable models.

In [141]:
print("\nNumber of duplicate rows:")
print(data.duplicated().sum())

data = data.drop_duplicates()
Number of duplicate rows:
0

Code Explanation: Prints a message indicating that the next output will show the count of duplicate rows. Calculates and displays the number of duplicate rows in the DataFrame. Removes duplicate rows from the DataFrame and updates the DataFrame with the result.

Why it's important: Identifying and removing duplicate rows is essential for maintaining the integrity and accuracy of the dataset.Duplicate data can lead to biased analysis, skewed results, and inaccurate models.Ensuring the dataset is free from duplicates improves the reliability and validity of the insights derived from the data.

In [142]:
print("\nBasic statistics for each column:")
print(data.describe())
Basic statistics for each column:
        Unnamed: 0           X1          age       gender       height  \
count  6264.000000  6264.000000  6264.000000  6264.000000  6264.000000   
mean   3132.500000  1771.144317    29.158525     0.476533   169.709052   
std    1808.405375  1097.988748     8.908978     0.499489    10.324698   
min       1.000000     1.000000    18.000000     0.000000   143.000000   
25%    1566.750000   789.750000    23.000000     0.000000   160.000000   
50%    3132.500000  1720.000000    28.000000     0.000000   168.000000   
75%    4698.250000  2759.250000    33.000000     1.000000   180.000000   
max    6264.000000  3670.000000    56.000000     1.000000   191.000000   

            weight        steps    hear_rate     calories     distance  \
count  6264.000000  6264.000000  6264.000000  6264.000000  6264.000000   
mean     69.614464   109.562268    86.142331    19.471823    13.832555   
std      13.451878   222.797908    28.648385    27.309765    45.941437   
min      43.000000     1.000000     2.222222     0.056269     0.000440   
25%      60.000000     5.159534    75.598079     0.735875     0.019135   
50%      68.000000    10.092029    77.267680     4.000000     0.181719   
75%      77.300000   105.847222    95.669118    20.500000    15.697188   
max     115.000000  1714.000000   194.333333    97.500000   335.000000   

       entropy_heart  entropy_setps  resting_heart  corr_heart_steps  \
count    6264.000000    6264.000000    6264.000000       6264.000000   
mean        6.030314       5.739984      65.869938          0.306447   
std         0.765574       1.256348      21.203017          0.775418   
min         0.000000       0.000000       3.000000         -1.000000   
25%         6.108524       5.909440      58.134333         -0.467303   
50%         6.189825       6.157197      75.000000          0.665829   
75%         6.247928       6.247928      76.138701          1.000000   
max         6.475733       6.475733     155.000000          1.000000   

        norm_heart  intensity_karvonen  sd_norm_heart  steps_times_distance  
count  6264.000000         6264.000000    6264.000000           6264.000000  
mean     20.272393            0.155479       8.110854            590.035239  
std      28.388116            0.210927      12.535080           4063.838530  
min     -76.000000           -2.714286       0.000000              0.000690  
25%       1.148883            0.009819       0.264722              0.659260  
50%       9.820254            0.079529       2.893503             13.368619  
75%      27.077336            0.211868       9.679672             93.728562  
max     156.319444            1.297980      74.457929          51520.000000  

Explanation:Prints a message indicating that the next output will show basic statistics for each column.Calculates and displays basic statistics for each column in the DataFrame, including metrics like count, mean, standard deviation, minimum, and maximum values, as well as the 25th, 50th (median), and 75th percentiles. Descriptive statistics provide a summary of the data, including measures such as mean, median, and standard deviation.

Why it's important: Provides a quick summary of the dataset, helping to understand the distribution and central tendencies of the data.Assists in identifying any potential issues or outliers in the dataset.Helps in making informed decisions about data preprocessing and feature engineering.

In [143]:
print("\nDistribution of each column:")
data.hist(bins=20, figsize=(20, 15))
plt.show()
Distribution of each column:
No description has been provided for this image

Code Explanation: Prints a message indicating that the next output will show the distribution of each column.Plots histograms for each column in the DataFrame with a specified number of bins and figure size.Displays the plotted histograms.Histograms help visualize the distribution and identify any skewness or outliers in the data.

Why it's important: Visualizing the distribution of each column helps to understand the data's overall spread and shape.Identifies patterns, trends, and potential outliers in the data.Aids in determining the appropriate statistical methods and transformations for further analysis.

In [144]:
print("\nPairplot to visualize relationships and distributions:")
sns.pairplot(data)
plt.show()
Pairplot to visualize relationships and distributions:
No description has been provided for this image

Code Explanation: Prints a message indicating that the next output will show a pairplot to visualize relationships and distributions.Generates a pairplot for the dataset, which includes scatter plots for pairwise relationships and histograms for individual distributions.Displays the generated pairplot.Pairplot helps visualize the pairwise relationships and distributions between different columns.

Why it's important: Pairplots help visualize the relationships between different variables in the dataset.They provide insights into the correlations and interactions between variables.The plots also help identify any patterns, trends, or potential outliers that could affect the analysis and model building.

Task 2: Handle Missing Values: Decide how to address missing values based on data context (e.g.,drop rows, use median/mean for filling).¶

In [145]:
print("\nMissing values in each column:")
print(data.isnull().sum())
Missing values in each column:
Unnamed: 0              0
X1                      0
age                     0
gender                  0
height                  0
weight                  0
steps                   0
hear_rate               0
calories                0
distance                0
entropy_heart           0
entropy_setps           0
resting_heart           0
corr_heart_steps        0
norm_heart              0
intensity_karvonen      0
sd_norm_heart           0
steps_times_distance    0
device                  0
activity                0
dtype: int64

Code Explanation:The code prints a message indicating the next output will show the count of missing values and then calculates and displays the number of missing values in each column of the DataFrame. Identifying missing values helps us understand which columns need imputation or removal.

Why it's important: Identifying missing data is crucial for data cleaning and preprocessing, ensuring the quality and completeness of your dataset. Properly handling missing data leads to more accurate analyses and reliable models.

In [146]:
data_dropped = data.dropna()
print("\nDataset after dropping rows with missing values:")
print(data_dropped.head())
Dataset after dropping rows with missing values:
   Unnamed: 0  X1  age  gender  height  weight      steps  hear_rate  \
0           1   1   20       1   168.0    65.4  10.771429  78.531302   
1           2   2   20       1   168.0    65.4  11.475325  78.453390   
2           3   3   20       1   168.0    65.4  12.179221  78.540825   
3           4   4   20       1   168.0    65.4  12.883117  78.628260   
4           5   5   20       1   168.0    65.4  13.587013  78.715695   

    calories  distance  entropy_heart  entropy_setps  resting_heart  \
0   0.344533  0.008327       6.221612       6.116349           59.0   
1   3.287625  0.008896       6.221612       6.116349           59.0   
2   9.484000  0.009466       6.221612       6.116349           59.0   
3  10.154556  0.010035       6.221612       6.116349           59.0   
4  10.825111  0.010605       6.221612       6.116349           59.0   

   corr_heart_steps  norm_heart  intensity_karvonen  sd_norm_heart  \
0          1.000000   19.531302            0.138520       1.000000   
1          1.000000   19.453390            0.137967       1.000000   
2          1.000000   19.540825            0.138587       1.000000   
3          1.000000   19.628260            0.139208       1.000000   
4          0.982816   19.715695            0.139828       0.241567   

   steps_times_distance       device activity  
0              0.089692  apple watch    Lying  
1              0.102088  apple watch    Lying  
2              0.115287  apple watch    Lying  
3              0.129286  apple watch    Lying  
4              0.144088  apple watch    Lying  

Code Explanation: Drops rows with missing values from the DataFrame and assigns the result to a new DataFrame named data_dropped.Prints a message indicating that the next output will show the dataset after dropping rows with missing values.Displays the first few rows of the updated DataFrame data_dropped.Dropping rows with missing values can be a simple strategy, but it may lead to loss of important data if many rows have missing values.

Why it's important: Removing rows with missing values is a common data cleaning step to ensure the integrity and quality of the dataset.This step helps to avoid potential biases or errors that may arise from incomplete data during analysis.It ensures that subsequent analyses and models are built on a more accurate and reliable dataset.

In [147]:
data_filled_median = data.copy()
numeric_columns = data.select_dtypes(include=['number']).columns
data_filled_median[numeric_columns] = data[numeric_columns].fillna(data[numeric_columns].median())

print("Dataset after filling missing values with median for numeric columns:")
print(data_filled_median.head())

print("\nMissing values in each column after handling missing values:")
print(data_filled_median.isnull().sum())
Dataset after filling missing values with median for numeric columns:
   Unnamed: 0  X1  age  gender  height  weight      steps  hear_rate  \
0           1   1   20       1   168.0    65.4  10.771429  78.531302   
1           2   2   20       1   168.0    65.4  11.475325  78.453390   
2           3   3   20       1   168.0    65.4  12.179221  78.540825   
3           4   4   20       1   168.0    65.4  12.883117  78.628260   
4           5   5   20       1   168.0    65.4  13.587013  78.715695   

    calories  distance  entropy_heart  entropy_setps  resting_heart  \
0   0.344533  0.008327       6.221612       6.116349           59.0   
1   3.287625  0.008896       6.221612       6.116349           59.0   
2   9.484000  0.009466       6.221612       6.116349           59.0   
3  10.154556  0.010035       6.221612       6.116349           59.0   
4  10.825111  0.010605       6.221612       6.116349           59.0   

   corr_heart_steps  norm_heart  intensity_karvonen  sd_norm_heart  \
0          1.000000   19.531302            0.138520       1.000000   
1          1.000000   19.453390            0.137967       1.000000   
2          1.000000   19.540825            0.138587       1.000000   
3          1.000000   19.628260            0.139208       1.000000   
4          0.982816   19.715695            0.139828       0.241567   

   steps_times_distance       device activity  
0              0.089692  apple watch    Lying  
1              0.102088  apple watch    Lying  
2              0.115287  apple watch    Lying  
3              0.129286  apple watch    Lying  
4              0.144088  apple watch    Lying  

Missing values in each column after handling missing values:
Unnamed: 0              0
X1                      0
age                     0
gender                  0
height                  0
weight                  0
steps                   0
hear_rate               0
calories                0
distance                0
entropy_heart           0
entropy_setps           0
resting_heart           0
corr_heart_steps        0
norm_heart              0
intensity_karvonen      0
sd_norm_heart           0
steps_times_distance    0
device                  0
activity                0
dtype: int64

Code Explanation: Creates a copy of the original DataFrame and assigns it to data_filled_median.Identifies the numeric columns in the DataFrame.Fills missing values in numeric columns with the median of each column and updates the DataFrame data_filled_median.Prints a message indicating that the next output will show the dataset after filling missing values with the median for numeric columns. Displays the first few rows of the updated DataFrame data_filled_median.Prints a message indicating that the next output will show the count of missing values in each column after handling missing values.Displays the count of missing values in each column of the updated DataFrame data_filled_median. Filling missing values with the median is useful because it is less affected by outliers compared to the mean.

Why it's important: Filling missing values with the median helps to maintain the distribution and central tendency of numeric data without being affected by outliers.Ensures that the dataset remains complete and reliable for analysis and modeling.This method is particularly useful for datasets with skewed distributions, as the median is less affected by extreme values compared to the mean.

In [148]:
data_filled_mean = data.copy()
numeric_columns = data.select_dtypes(include=['number']).columns
data_filled_mean[numeric_columns] = data[numeric_columns].fillna(data[numeric_columns].mean())

print("Dataset after filling missing values with mean for numeric columns:")
print(data_filled_mean.head())

print("\nMissing values in each column after handling missing values:")
print(data_filled_mean.isnull().sum())
Dataset after filling missing values with mean for numeric columns:
   Unnamed: 0  X1  age  gender  height  weight      steps  hear_rate  \
0           1   1   20       1   168.0    65.4  10.771429  78.531302   
1           2   2   20       1   168.0    65.4  11.475325  78.453390   
2           3   3   20       1   168.0    65.4  12.179221  78.540825   
3           4   4   20       1   168.0    65.4  12.883117  78.628260   
4           5   5   20       1   168.0    65.4  13.587013  78.715695   

    calories  distance  entropy_heart  entropy_setps  resting_heart  \
0   0.344533  0.008327       6.221612       6.116349           59.0   
1   3.287625  0.008896       6.221612       6.116349           59.0   
2   9.484000  0.009466       6.221612       6.116349           59.0   
3  10.154556  0.010035       6.221612       6.116349           59.0   
4  10.825111  0.010605       6.221612       6.116349           59.0   

   corr_heart_steps  norm_heart  intensity_karvonen  sd_norm_heart  \
0          1.000000   19.531302            0.138520       1.000000   
1          1.000000   19.453390            0.137967       1.000000   
2          1.000000   19.540825            0.138587       1.000000   
3          1.000000   19.628260            0.139208       1.000000   
4          0.982816   19.715695            0.139828       0.241567   

   steps_times_distance       device activity  
0              0.089692  apple watch    Lying  
1              0.102088  apple watch    Lying  
2              0.115287  apple watch    Lying  
3              0.129286  apple watch    Lying  
4              0.144088  apple watch    Lying  

Missing values in each column after handling missing values:
Unnamed: 0              0
X1                      0
age                     0
gender                  0
height                  0
weight                  0
steps                   0
hear_rate               0
calories                0
distance                0
entropy_heart           0
entropy_setps           0
resting_heart           0
corr_heart_steps        0
norm_heart              0
intensity_karvonen      0
sd_norm_heart           0
steps_times_distance    0
device                  0
activity                0
dtype: int64

Code Explanation: Creates a copy of the original DataFrame and assigns it to data_filled_mean.Identifies the numeric columns in the DataFrame.Fills missing values in numeric columns with the mean of each column and updates the DataFrame data_filled_mean.Prints a message indicating that the next output will show the dataset after filling missing values with the mean for numeric columns.Displays the first few rows of the updated DataFrame data_filled_mean.Prints a message indicating that the next output will show the count of missing values in each column after handling missing values.Displays the count of missing values in each column of the updated DataFrame data_filled_meanFilling missing values with the mean is a common strategy, but it can be influenced by outliers in the data.

Why it's important: Filling missing values with the mean helps to maintain the overall central tendency of numeric data, though it can be affected by outliers.Ensures that the dataset remains complete and ready for analysis and modeling.This method is useful for datasets where the mean is a good representation of the central value, ensuring consistent data integrity.

In [149]:
specific_value = 0
data_filled_value = data.fillna(specific_value)
print("\nDataset after filling missing values with a specific value:")
print(data_filled_value.head())
Dataset after filling missing values with a specific value:
   Unnamed: 0  X1  age  gender  height  weight      steps  hear_rate  \
0           1   1   20       1   168.0    65.4  10.771429  78.531302   
1           2   2   20       1   168.0    65.4  11.475325  78.453390   
2           3   3   20       1   168.0    65.4  12.179221  78.540825   
3           4   4   20       1   168.0    65.4  12.883117  78.628260   
4           5   5   20       1   168.0    65.4  13.587013  78.715695   

    calories  distance  entropy_heart  entropy_setps  resting_heart  \
0   0.344533  0.008327       6.221612       6.116349           59.0   
1   3.287625  0.008896       6.221612       6.116349           59.0   
2   9.484000  0.009466       6.221612       6.116349           59.0   
3  10.154556  0.010035       6.221612       6.116349           59.0   
4  10.825111  0.010605       6.221612       6.116349           59.0   

   corr_heart_steps  norm_heart  intensity_karvonen  sd_norm_heart  \
0          1.000000   19.531302            0.138520       1.000000   
1          1.000000   19.453390            0.137967       1.000000   
2          1.000000   19.540825            0.138587       1.000000   
3          1.000000   19.628260            0.139208       1.000000   
4          0.982816   19.715695            0.139828       0.241567   

   steps_times_distance       device activity  
0              0.089692  apple watch    Lying  
1              0.102088  apple watch    Lying  
2              0.115287  apple watch    Lying  
3              0.129286  apple watch    Lying  
4              0.144088  apple watch    Lying  

Code Explanation: Creates a DataFrame named data_filled_value by filling all missing values in the original DataFrame with a specific value (in this case, 0).Prints a message indicating that the next output will show the dataset after filling missing values with a specific value.Displays the first few rows of the updated DataFrame data_filled_value. Filling missing values with a specific value(e.g,0) can be useful when you want to indicate the absence of data explicitly.

Why it's important: Filling missing values with a specific value ensures there are no gaps in the dataset, making it complete and ready for analysis.This approach is useful when a default value (like 0) makes sense in the context of the data, ensuring consistency.Helps avoid potential issues in subsequent analyses and modeling caused by missing data, leading to more accurate and reliable results.

In [150]:
data.columns = data.columns.str.strip()

Code Explanation: Removes any leading or trailing whitespace from the column names of the DataFrame.Cleaning column names ensures consistency and avoids issues during data manipulation.

Why it's important: Ensures consistency in column names, which is crucial for accurate data manipulation and analysis.Prevents potential issues caused by extra spaces when accessing columns by name or merging datasets.Improves the overall cleanliness and readability of the dataset, making it easier to work with.

In [151]:
print("\nMissing values in each column after handling missing values:")
print(data.isnull().sum())
Missing values in each column after handling missing values:
Unnamed: 0              0
X1                      0
age                     0
gender                  0
height                  0
weight                  0
steps                   0
hear_rate               0
calories                0
distance                0
entropy_heart           0
entropy_setps           0
resting_heart           0
corr_heart_steps        0
norm_heart              0
intensity_karvonen      0
sd_norm_heart           0
steps_times_distance    0
device                  0
activity                0
dtype: int64

Code Explanation: Prints a message indicating that the next output will show the count of missing values in each column after handling missing values.Displays the count of missing values in each column of the DataFrame.This step helps us verify that the missing values have been effectively handled.

Why it's important: Provides a final check to confirm that all missing values have been appropriately handled. b. Ensures the dataset is now clean and ready for accurate analysis and modeling without the issues caused by missing data.

Task 3: Data Transformation: Convert categorical variables (like gender) to numeric if needed and normalize numerical data for consistent analysis.¶

In [152]:
print("First few rows of the dataset:")
print(data.head())
First few rows of the dataset:
   Unnamed: 0  X1  age  gender  height  weight      steps  hear_rate  \
0           1   1   20       1   168.0    65.4  10.771429  78.531302   
1           2   2   20       1   168.0    65.4  11.475325  78.453390   
2           3   3   20       1   168.0    65.4  12.179221  78.540825   
3           4   4   20       1   168.0    65.4  12.883117  78.628260   
4           5   5   20       1   168.0    65.4  13.587013  78.715695   

    calories  distance  entropy_heart  entropy_setps  resting_heart  \
0   0.344533  0.008327       6.221612       6.116349           59.0   
1   3.287625  0.008896       6.221612       6.116349           59.0   
2   9.484000  0.009466       6.221612       6.116349           59.0   
3  10.154556  0.010035       6.221612       6.116349           59.0   
4  10.825111  0.010605       6.221612       6.116349           59.0   

   corr_heart_steps  norm_heart  intensity_karvonen  sd_norm_heart  \
0          1.000000   19.531302            0.138520       1.000000   
1          1.000000   19.453390            0.137967       1.000000   
2          1.000000   19.540825            0.138587       1.000000   
3          1.000000   19.628260            0.139208       1.000000   
4          0.982816   19.715695            0.139828       0.241567   

   steps_times_distance       device activity  
0              0.089692  apple watch    Lying  
1              0.102088  apple watch    Lying  
2              0.115287  apple watch    Lying  
3              0.129286  apple watch    Lying  
4              0.144088  apple watch    Lying  

Code Explanation: Prints a message indicating that the next output will show the first few rows of the dataset.Displays the first few rows of the DataFrame.Inspecting the first few rows helps us understand the data structure and identify any immediate issues.

Why it's important: Provides a quick overview of the dataset, helping to understand its structure and content.Assists in verifying that the data has been loaded correctly and in its expected format.Helps identify any initial issues or anomalies in the dataset before further analysis.

In [153]:
if 'gender' in data.columns:
    label_encoder = LabelEncoder()
    data['gender'] = label_encoder.fit_transform(data['gender'])

print("\nDataset after converting categorical variables to numeric:")
print(data.head())
Dataset after converting categorical variables to numeric:
   Unnamed: 0  X1  age  gender  height  weight      steps  hear_rate  \
0           1   1   20       1   168.0    65.4  10.771429  78.531302   
1           2   2   20       1   168.0    65.4  11.475325  78.453390   
2           3   3   20       1   168.0    65.4  12.179221  78.540825   
3           4   4   20       1   168.0    65.4  12.883117  78.628260   
4           5   5   20       1   168.0    65.4  13.587013  78.715695   

    calories  distance  entropy_heart  entropy_setps  resting_heart  \
0   0.344533  0.008327       6.221612       6.116349           59.0   
1   3.287625  0.008896       6.221612       6.116349           59.0   
2   9.484000  0.009466       6.221612       6.116349           59.0   
3  10.154556  0.010035       6.221612       6.116349           59.0   
4  10.825111  0.010605       6.221612       6.116349           59.0   

   corr_heart_steps  norm_heart  intensity_karvonen  sd_norm_heart  \
0          1.000000   19.531302            0.138520       1.000000   
1          1.000000   19.453390            0.137967       1.000000   
2          1.000000   19.540825            0.138587       1.000000   
3          1.000000   19.628260            0.139208       1.000000   
4          0.982816   19.715695            0.139828       0.241567   

   steps_times_distance       device activity  
0              0.089692  apple watch    Lying  
1              0.102088  apple watch    Lying  
2              0.115287  apple watch    Lying  
3              0.129286  apple watch    Lying  
4              0.144088  apple watch    Lying  

Code Explanation: Checks if the 'gender' column exists in the DataFrame.If it does, applies label encoding to convert the categorical 'gender' variable into numeric values using LabelEncoder.Prints a message indicating that the next output will show the dataset after converting categorical variables to numeric.Displays the first few rows of the updated DataFrame.Converting categorical variables to numeric values is essential for machine learning algorithms that require numerical input. LabelEncoder transforms categorical labels into integer values.

Why it's important: Converting categorical variables to numeric is essential for many machine learning algorithms that require numerical input.It allows for the inclusion of categorical data in predictive models, expanding the range of features available for analysis.Ensures the dataset is ready for further analysis and modeling, enhancing the overall robustness and accuracy of the models.

In [154]:
numeric_columns = data.select_dtypes(include=[np.number]).columns

print("Descriptive statistics before scaling:")
before_scaling = data[numeric_columns].describe()
print(before_scaling)

scaler = StandardScaler()

data_scaled = data.copy()
data_scaled[numeric_columns] = scaler.fit_transform(data[numeric_columns])

print("\nDescriptive statistics after scaling:")
after_scaling = data_scaled[numeric_columns].describe()
print(after_scaling)
Descriptive statistics before scaling:
        Unnamed: 0           X1          age       gender       height  \
count  6264.000000  6264.000000  6264.000000  6264.000000  6264.000000   
mean   3132.500000  1771.144317    29.158525     0.476533   169.709052   
std    1808.405375  1097.988748     8.908978     0.499489    10.324698   
min       1.000000     1.000000    18.000000     0.000000   143.000000   
25%    1566.750000   789.750000    23.000000     0.000000   160.000000   
50%    3132.500000  1720.000000    28.000000     0.000000   168.000000   
75%    4698.250000  2759.250000    33.000000     1.000000   180.000000   
max    6264.000000  3670.000000    56.000000     1.000000   191.000000   

            weight        steps    hear_rate     calories     distance  \
count  6264.000000  6264.000000  6264.000000  6264.000000  6264.000000   
mean     69.614464   109.562268    86.142331    19.471823    13.832555   
std      13.451878   222.797908    28.648385    27.309765    45.941437   
min      43.000000     1.000000     2.222222     0.056269     0.000440   
25%      60.000000     5.159534    75.598079     0.735875     0.019135   
50%      68.000000    10.092029    77.267680     4.000000     0.181719   
75%      77.300000   105.847222    95.669118    20.500000    15.697188   
max     115.000000  1714.000000   194.333333    97.500000   335.000000   

       entropy_heart  entropy_setps  resting_heart  corr_heart_steps  \
count    6264.000000    6264.000000    6264.000000       6264.000000   
mean        6.030314       5.739984      65.869938          0.306447   
std         0.765574       1.256348      21.203017          0.775418   
min         0.000000       0.000000       3.000000         -1.000000   
25%         6.108524       5.909440      58.134333         -0.467303   
50%         6.189825       6.157197      75.000000          0.665829   
75%         6.247928       6.247928      76.138701          1.000000   
max         6.475733       6.475733     155.000000          1.000000   

        norm_heart  intensity_karvonen  sd_norm_heart  steps_times_distance  
count  6264.000000         6264.000000    6264.000000           6264.000000  
mean     20.272393            0.155479       8.110854            590.035239  
std      28.388116            0.210927      12.535080           4063.838530  
min     -76.000000           -2.714286       0.000000              0.000690  
25%       1.148883            0.009819       0.264722              0.659260  
50%       9.820254            0.079529       2.893503             13.368619  
75%      27.077336            0.211868       9.679672             93.728562  
max     156.319444            1.297980      74.457929          51520.000000  

Descriptive statistics after scaling:
        Unnamed: 0            X1           age        gender        height  \
count  6264.000000  6.264000e+03  6.264000e+03  6.264000e+03  6.264000e+03   
mean      0.000000  9.074620e-17  2.268655e-17  2.177909e-16 -5.172533e-16   
std       1.000080  1.000080e+00  1.000080e+00  1.000080e+00  1.000080e+00   
min      -1.731774 -1.612299e+00 -1.252603e+00 -9.541166e-01 -2.587115e+00   
25%      -0.865887 -8.938823e-01 -6.913270e-01 -9.541166e-01 -9.404466e-01   
50%       0.000000 -4.658372e-02 -1.300505e-01 -9.541166e-01 -1.655436e-01   
75%       0.865887  8.999952e-01  4.312259e-01  1.048090e+00  9.968107e-01   
max       1.731774  1.729533e+00  3.013097e+00  1.048090e+00  2.062302e+00   

             weight         steps     hear_rate      calories      distance  \
count  6.264000e+03  6.264000e+03  6.264000e+03  6.264000e+03  6.264000e+03   
mean   3.720594e-16  1.814924e-16  9.074620e-17  1.814924e-16 -1.814924e-17   
std    1.000080e+00  1.000080e+00  1.000080e+00  1.000080e+00  1.000080e+00   
min   -1.978652e+00 -4.873068e-01 -2.929548e+00 -7.109949e-01 -3.011055e-01   
25%   -7.147873e-01 -4.686358e-01 -3.680869e-01 -6.861079e-01 -3.006985e-01   
50%   -1.200273e-01 -4.464951e-01 -3.098031e-01 -5.665760e-01 -2.971593e-01   
75%    5.713812e-01 -1.667584e-02  3.325684e-01  3.765169e-02  4.059042e-02   
max    3.374187e+00  7.201889e+00  3.776815e+00  2.857381e+00  6.991359e+00   

       entropy_heart  entropy_setps  resting_heart  corr_heart_steps  \
count   6.264000e+03   6.264000e+03   6.264000e+03      6.264000e+03   
mean   -9.165366e-16   7.985665e-16  -2.858505e-16     -3.629848e-17   
std     1.000080e+00   1.000080e+00   1.000080e+00      1.000080e+00   
min    -7.877479e+00  -4.569150e+00  -2.965378e+00     -1.684964e+00   
25%     1.021668e-01   1.348908e-01  -3.648642e-01     -9.979289e-01   
50%     2.083702e-01   3.321103e-01   4.306364e-01      4.635066e-01   
75%     2.842709e-01   4.043337e-01   4.843453e-01      8.944970e-01   
max     5.818567e-01   5.856720e-01   4.203985e+00      8.944970e-01   

         norm_heart  intensity_karvonen  sd_norm_heart  steps_times_distance  
count  6.264000e+03        6.264000e+03   6.264000e+03          6.264000e+03  
mean   9.074620e-17       -1.179701e-16   3.629848e-17         -9.074620e-18  
std    1.000080e+00        1.000080e+00   1.000080e+00          1.000080e+00  
min   -3.391563e+00       -1.360661e+01  -6.471041e-01         -1.452030e-01  
25%   -6.736987e-01       -6.906303e-01  -6.259839e-01         -1.450410e-01  
50%   -3.682165e-01       -3.601110e-01  -4.162532e-01         -1.419133e-01  
75%    2.397301e-01        2.673566e-01   1.251642e-01         -1.221373e-01  
max    4.792777e+00        5.417012e+00   5.293335e+00          1.253348e+01  

Code Explanation: Identifies the numeric columns in the DataFrame.Prints a message indicating that the next output will show descriptive statistics before scaling.Calculates and displays descriptive statistics (e.g., mean, standard deviation, min, max, percentiles) for the numeric columns before scaling.Creates an instance of the StandardScaler.Copies the original DataFrame.Applies standard scaling to the numeric columns in the copied DataFrame, transforming the data to have a mean of 0 and standard deviation of 1.Prints a message indicating that the next output will show descriptive statistics after scaling.Calculates and displays descriptive statistics for the numeric columns after scaling.StandardScaler standardizes features by removing the mean and scaling to unit variance. This ensures that all features contribute equally to the model and helps improve model performance.

Why it's important: Understanding the distribution of data before and after scaling helps ensure the scaling process is correct and effective.Standard scaling standardizes numeric features, making them suitable for many machine learning algorithms that require normalized data.Ensures that each feature contributes equally to the model, preventing features with larger scales from dominating the learning process.

In [155]:
after_scaling = data_scaled[numeric_columns].describe()

changes_detected = not before_scaling.equals(after_scaling)

if changes_detected:
    print("\nYes, there are changes detected in the numeric columns.")
else:
    print("\nNo, there are no changes detected in the numeric columns.")
Yes, there are changes detected in the numeric columns.

Code Explanation: Calculates and stores descriptive statistics for the scaled numeric columns.Compares the descriptive statistics of the numeric columns before and after scaling to detect any changes.Prints a message indicating whether changes were detected in the numeric columns.Comparing descriptive statistics before and after scaling helps us verify that the normalization process has altered the data as expected, ensuring consistency in the dataset.

Why it's important: Ensures that the scaling process has been properly applied and verifies the transformation of the data.Confirms the effectiveness of data preprocessing steps, which is crucial for accurate and consistent data analysis and modeling.Helps identify any potential issues with the scaling process that may need further attention.

Part 2: Exploratory Data Analysis (EDA)¶

Task 4: Visualize Key Metrics: Plot distributions of age, gender, heart rate, and steps.¶

In [156]:
print("First few rows of the dataset:")
print(data.head())
First few rows of the dataset:
   Unnamed: 0  X1  age  gender  height  weight      steps  hear_rate  \
0           1   1   20       1   168.0    65.4  10.771429  78.531302   
1           2   2   20       1   168.0    65.4  11.475325  78.453390   
2           3   3   20       1   168.0    65.4  12.179221  78.540825   
3           4   4   20       1   168.0    65.4  12.883117  78.628260   
4           5   5   20       1   168.0    65.4  13.587013  78.715695   

    calories  distance  entropy_heart  entropy_setps  resting_heart  \
0   0.344533  0.008327       6.221612       6.116349           59.0   
1   3.287625  0.008896       6.221612       6.116349           59.0   
2   9.484000  0.009466       6.221612       6.116349           59.0   
3  10.154556  0.010035       6.221612       6.116349           59.0   
4  10.825111  0.010605       6.221612       6.116349           59.0   

   corr_heart_steps  norm_heart  intensity_karvonen  sd_norm_heart  \
0          1.000000   19.531302            0.138520       1.000000   
1          1.000000   19.453390            0.137967       1.000000   
2          1.000000   19.540825            0.138587       1.000000   
3          1.000000   19.628260            0.139208       1.000000   
4          0.982816   19.715695            0.139828       0.241567   

   steps_times_distance       device activity  
0              0.089692  apple watch    Lying  
1              0.102088  apple watch    Lying  
2              0.115287  apple watch    Lying  
3              0.129286  apple watch    Lying  
4              0.144088  apple watch    Lying  

Code Explanation: Prints a message indicating that the next output will show the first few rows of the dataset.Provides a quick overview of the dataset, helping to understand its structure and content. b. Assists in verifying that the data has been loaded correctly and in its expected format. c. Helps identify any initial issues or anomalies in the dataset before further analysis.

Why it's important: Provides a quick overview of the dataset, helping to understand its structure and content.Assists in verifying that the data has been loaded correctly and in its expected format. Helps identify any initial issues or anomalies in the dataset before further analysis.

In [157]:
plt.figure(figsize=(10, 5))
sns.histplot(data['age'], kde=True)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
No description has been provided for this image

Code Explanation: Creates a figure with a specified size for the plot.Plots a histogram of the 'age' column from the DataFrame with a kernel density estimate (KDE) line.Sets the title of the plot to "Distribution of Age".Labels the x-axis as "Age".Labels the y-axis as "Frequency".Displays the plot.We plot the distribution of age using a histogram with a kernel density estimate (KDE). Visualizing the distribution of age helps us understand the age demographics of the dataset.

Why it's important: Visualizing the distribution of the 'age' variable helps to understand the spread and central tendency of the age data.The histogram with KDE provides insights into the density and shape of the age distribution.Identifying patterns, trends, and potential outliers in the age data can inform further analysis and decision-making.

In [158]:
plt.figure(figsize=(10, 5))
sns.countplot(x='gender', data=data)
plt.title('Distribution of Gender')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()
No description has been provided for this image

Code Explanation: Creates a figure with a specified size for the plot.Plots a count plot of the 'gender' column from the DataFrame.Sets the title of the plot to "Distribution of Gender".Labels the x-axis as "Gender".Labels the y-axis as "Count".Displays the plot .We plot the distribution of gender using a count plot. Visualizing the distribution of gender helps us understand the gender composition of the dataset.

Why it's important: Visualizing the distribution of the 'gender' variable helps to understand the frequency of each gender category in the dataset.The count plot provides a clear representation of the count of each gender category.This visualization can reveal imbalances or trends in the gender distribution, which may be relevant for further analysis and decision-making.

In [159]:
plt.figure(figsize=(10, 5))
sns.histplot(data['hear_rate'], kde=True)
plt.title('Distribution of Heart Rate')
plt.xlabel('Heart Rate')
plt.ylabel('Frequency')
plt.show()
No description has been provided for this image

Code Explanation: Creates a figure with a specified size for the plot.Plots a histogram of the 'heart_rate' column from the DataFrame with a kernel density estimate (KDE) line.Sets the title of the plot to "Distribution of Heart Rate".Labels the x-axis as "Heart Rate".Labels the y-axis as "Frequency".Displays the plot..We plot the distribution of heart rate using a histogram with a kernel density estimate (KDE). Visualizing the distribution of heart rate helps us understand the heart rate patterns of the individuals in the dataset.

Why it's important: Visualizing the distribution of the 'heart_rate' variable helps to understand the spread and central tendency of heart rate data.The histogram with KDE provides insights into the density and shape of the heart rate distribution. cIdentifying patterns, trends, and potential outliers in heart rate data can inform further analysis and decision-making.

In [160]:
plt.figure(figsize=(10, 5))
sns.histplot(data['steps'], kde=True)
plt.title('Distribution of Steps')
plt.xlabel('Steps')
plt.ylabel('Frequency')
plt.show()
No description has been provided for this image

Code Explanation: Creates a figure with a specified size for the plot.Plots a histogram of the 'steps' column from the DataFrame with a kernel density estimate (KDE) line.Sets the title of the plot to "Distribution of Steps".Labels the x-axis as "Steps".Labels the y-axis as "Frequency".Displays the plot.We plot the distribution of steps using a histogram with a kernel density estimate (KDE). Visualizing the distribution of steps helps us understand the activity levels of the individuals in the dataset.

Why i's important: Visualizing the distribution of the 'steps' variable helps to understand the spread and central tendency of the steps data.The histogram with KDE provides insights into the density and shape of the steps distribution.Identifying patterns, trends, and potential outliers in the steps data can inform further analysis and decision-making.

Task 5: Correlation Analysis: Check correlations between variables like steps, heart rate, calories and distance.¶

In [161]:
print("First few rows of the dataset:")
print(data.head())
First few rows of the dataset:
   Unnamed: 0  X1  age  gender  height  weight      steps  hear_rate  \
0           1   1   20       1   168.0    65.4  10.771429  78.531302   
1           2   2   20       1   168.0    65.4  11.475325  78.453390   
2           3   3   20       1   168.0    65.4  12.179221  78.540825   
3           4   4   20       1   168.0    65.4  12.883117  78.628260   
4           5   5   20       1   168.0    65.4  13.587013  78.715695   

    calories  distance  entropy_heart  entropy_setps  resting_heart  \
0   0.344533  0.008327       6.221612       6.116349           59.0   
1   3.287625  0.008896       6.221612       6.116349           59.0   
2   9.484000  0.009466       6.221612       6.116349           59.0   
3  10.154556  0.010035       6.221612       6.116349           59.0   
4  10.825111  0.010605       6.221612       6.116349           59.0   

   corr_heart_steps  norm_heart  intensity_karvonen  sd_norm_heart  \
0          1.000000   19.531302            0.138520       1.000000   
1          1.000000   19.453390            0.137967       1.000000   
2          1.000000   19.540825            0.138587       1.000000   
3          1.000000   19.628260            0.139208       1.000000   
4          0.982816   19.715695            0.139828       0.241567   

   steps_times_distance       device activity  
0              0.089692  apple watch    Lying  
1              0.102088  apple watch    Lying  
2              0.115287  apple watch    Lying  
3              0.129286  apple watch    Lying  
4              0.144088  apple watch    Lying  

Code Explanation: Prints a message indicating that the next output will show the first few rows of the dataset.Displays the first few rows of the DataFrame.Inspecting the first few rows helps us understand the data structure and identify any immediate issues.

Why it's important: Provides a quick overview of the dataset, helping to understand its structure and content.Assists in verifying that the data has been loaded correctly and in its expected format.Helps identify any initial issues or anomalies in the dataset before further analysis.

In [162]:
columns_of_interest = ['steps', 'hear_rate', 'calories', 'distance']

missing_columns = [col for col in columns_of_interest if col not in data.columns]
if missing_columns:
    print(f"Columns missing in the dataset: {missing_columns}")
else:
    corr_matrix = data[columns_of_interest].corr()

    print("\nCorrelation matrix:")
    print(corr_matrix)
Correlation matrix:
              steps  hear_rate  calories  distance
steps      1.000000   0.164084 -0.250973 -0.090433
hear_rate  0.164084   1.000000 -0.141972 -0.068879
calories  -0.250973  -0.141972  1.000000  0.255145
distance  -0.090433  -0.068879  0.255145  1.000000

Code Explanation: Defines a list of columns of interest: 'steps', 'heart_rate', 'calories', and 'distance'.Creates a list of columns from the columns of interest that are not present in the DataFrame.Checks if there are any missing columns.If there are missing columns, prints a message indicating which columns are missing.If all columns are present, calculates the correlation matrix for the specified columns.Prints a message indicating that the next output will show the correlation matrix.Displays the correlation matrix for the specified columns.We select the columns that are relevant for correlation analysis. Selecting relevant columns ensures that we focus on the metrics that are important for our analysis. Checking for missing columns helps us identify any discrepancies. If all relevant columns exist, we calculate the correlation matrix for the selected columns. The correlation matrix quantifies the strength and direction of the relationships between the selected columns. It helps us understand how these metrics are related.

Why it's important: Ensures that the dataset contains all necessary columns for analysis before proceeding.The correlation matrix helps identify relationships between variables, indicating how one variable might change as another variable changes.Understanding correlations is crucial for feature selection, data analysis, and building predictive models.It provides insights into potential multicollinearity issues, helping to improve the robustness and accuracy of the models.

In [163]:
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()
No description has been provided for this image

Code Explanation: Creates a figure with a specified size for the heatmap plot.Plots a heatmap of the correlation matrix with annotations for each correlation value.Uses a 'coolwarm' color map to visually represent the correlation values.Sets a title for the plot as "Correlation Heatmap".Displays the heatmap.We visualize the correlation matrix using a heatmap to easily interpret the relationships between the metrics. A heatmap provides a graphical representation of the correlation matrix, making it easier to identify strong positive or negative correlations between the metrics.

Why it's important: Visualizing the correlation matrix as a heatmap provides an intuitive and easy-to-understand representation of the relationships between variables.The color coding helps quickly identify strong positive or negative correlations between pairs of variables.Annotating the heatmap with correlation values adds clarity and precision to the visualization.Understanding these correlations is crucial for feature selection, data analysis, and building predictive models, as it reveals how different variables relate to each other.

Task 6: Trend Analysis: Examine trends between heart rate, calories, and activity type.¶

In [164]:
print("First few rows of the dataset:")
print(data.head())
First few rows of the dataset:
   Unnamed: 0  X1  age  gender  height  weight      steps  hear_rate  \
0           1   1   20       1   168.0    65.4  10.771429  78.531302   
1           2   2   20       1   168.0    65.4  11.475325  78.453390   
2           3   3   20       1   168.0    65.4  12.179221  78.540825   
3           4   4   20       1   168.0    65.4  12.883117  78.628260   
4           5   5   20       1   168.0    65.4  13.587013  78.715695   

    calories  distance  entropy_heart  entropy_setps  resting_heart  \
0   0.344533  0.008327       6.221612       6.116349           59.0   
1   3.287625  0.008896       6.221612       6.116349           59.0   
2   9.484000  0.009466       6.221612       6.116349           59.0   
3  10.154556  0.010035       6.221612       6.116349           59.0   
4  10.825111  0.010605       6.221612       6.116349           59.0   

   corr_heart_steps  norm_heart  intensity_karvonen  sd_norm_heart  \
0          1.000000   19.531302            0.138520       1.000000   
1          1.000000   19.453390            0.137967       1.000000   
2          1.000000   19.540825            0.138587       1.000000   
3          1.000000   19.628260            0.139208       1.000000   
4          0.982816   19.715695            0.139828       0.241567   

   steps_times_distance       device activity  
0              0.089692  apple watch    Lying  
1              0.102088  apple watch    Lying  
2              0.115287  apple watch    Lying  
3              0.129286  apple watch    Lying  
4              0.144088  apple watch    Lying  

Code Explanation: Prints a message indicating that the next output will show the first few rows of the dataset.Displays the first few rows of the DataFrame.We display the first few rows of the dataset to understand its structure and contents.

Why it's important: Provides a quick overview of the dataset, helping to understand its structure and content.Assists in verifying that the data has been loaded correctly and in its expected format.Helps identify any initial issues or anomalies in the dataset before further analysis.

In [165]:
print("Column names:")
print(data.columns)
Column names:
Index(['Unnamed: 0', 'X1', 'age', 'gender', 'height', 'weight', 'steps',
       'hear_rate', 'calories', 'distance', 'entropy_heart', 'entropy_setps',
       'resting_heart', 'corr_heart_steps', 'norm_heart', 'intensity_karvonen',
       'sd_norm_heart', 'steps_times_distance', 'device', 'activity'],
      dtype='object')

Code Explanation: Prints a message indicating that the next output will show the column names of the dataset.Displays the column names of the DataFrame.Checking column names ensures that we are working with the correct columns for our analysis.

Why it's important: Provides a quick overview of the column names in the dataset, helping to understand its structure and content.Ensures that the columns are named correctly and match the expected schema.Assists in identifying the available features and their respective names, which is crucial for data analysis and preprocessing tasks.

In [166]:
if 'activity' in data.columns:
    data['activity'] = data['activity'].astype('category')

Code Expanation: Checks if the 'activity' column exists in the DataFrame. If it does, converts the data type of the 'activity' column to a categorical type.We convert the 'activity' column to a categorical variable if it exists. Converting 'activity' to a categorical variable allows for better handling and visualization of activity types.

Why it's important: Converting columns to categorical types optimizes memory usage, especially when dealing with a large dataset.It helps improve the efficiency of data processing and analysis.Categorical data types are useful for variables that represent discrete categories or classes, allowing for more meaningful analysis and interpretation of the data.

In [167]:
plt.figure(figsize=(12, 6))
sns.lineplot(data=data, x='activity', y='hear_rate', marker='o', sort=False)
plt.title('Trend Analysis: Heart Rate vs. Activity')
plt.xlabel('Activity')
plt.ylabel('Heart Rate')
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image

Code Explanation: Creates a figure with a specified size for the line plot.Plots a line plot showing the trend between the 'activity' and 'heart_rate' columns from the DataFrame.Adds markers (e.g., 'o') to each data point on the line plot.Disables automatic sorting of the 'activity' axis values.Sets the title of the plot to "Trend Analysis: Heart Rate vs. Activity".Labels the x-axis as "Activity".Labels the y-axis as "Heart Rate".Rotates the x-axis labels by 45 degrees for better readability.Displays the plot.We perform trend analysis to visualize the relationship between heart rate and different activities. Visualizing the trend of heart rate across different activities helps us understand how various activities impact heart rate.

Why it's important: Visualizing trends between heart rate and activity helps understand how different activities affect heart rate.The line plot reveals patterns and relationships, providing insights into how heart rate varies with activities.Marking each data point adds clarity, making it easier to identify specific values and changes.This visualization aids in analyzing the impact of various activities on heart rate, which can be valuable for health and fitness studies.

In [168]:
plt.figure(figsize=(12, 6))
sns.lineplot(data=data, x='activity', y='calories', marker='o', sort=False)
plt.title('Trend Analysis: Calories Burned vs. Activity')
plt.xlabel('Activity')
plt.ylabel('Calories Burned')
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image

Code Explanation: Creates a figure with a specified size for the line plot. Plots a line plot showing the trend between the 'activity' and 'calories' columns from the DataFrame.Adds markers (e.g., 'o') to each data point on the line plot.Disables automatic sorting of the 'activity' axis values.Sets the title of the plot to "Trend Analysis: Calories Burned vs. Activity".Labels the x-axis as "Activity".Labels the y-axis as "Calories Burned".Rotates the x-axis labels by 45 degrees for better readability.Displays the plot.We perform trend analysis to visualize the relationship between calories burned and different activities. Visualizing the trend of calories burned across different activities helps us understand energy expenditure for various activities.

Why it's important: Visualizing trends between calories burned and activity helps understand how different activities affect calorie expenditure.The line plot reveals patterns and relationships, providing insights into how calories burned vary with different activities.Marking each data point adds clarity, making it easier to identify specific values and changes.This visualization aids in analyzing the impact of various activities on calorie expenditure, which can be valuable for health and fitness studies.

In [169]:
plt.figure(figsize=(12, 6))
sns.scatterplot(data=data, x='hear_rate', y='calories', hue='activity')
plt.title('Relationship between Heart Rate and Calories Burned by Activity')
plt.xlabel('Heart Rate')
plt.ylabel('Calories Burned')
plt.legend(title='Activity', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
No description has been provided for this image

Code Explanation: Creates a figure with a specified size for the scatter plot.Plots a scatter plot showing the relationship between the 'heart_rate' and 'calories' columns, with different colors for each 'activity' category.Sets the title of the plot to "Relationship between Heart Rate and Calories Burned by Activity".Labels the x-axis as "Heart Rate".Labels the y-axis as "Calories Burned".Adds a legend to the plot, titled "Activity", and positions it to the upper left outside the plot.We visualize the relationship between heart rate and calories burned, colored by activity type. Visualizing the relationship between heart rate and calories burned, differentiated by activity type, helps us identify patterns and correlations.

Why it's important: Visualizing the relationship between heart rate and calories burned helps to understand how heart rate impacts calorie expenditure during different activities.The scatter plot provides a clear representation of how these variables are related, revealing patterns and trends.Using color to differentiate activities allows for easier comparison and analysis of how different activities affect the relationship.This visualization aids in identifying specific activities that have higher or lower calorie burn rates relative to heart rate, which can be valuable for health and fitness studies.

Part 3: Feature Engineering and Model Building¶

Task 7: Create New Features: Generate additional insights by combining or transforming columns, such as steps_times_distance.¶

In [171]:
data['steps_times_distance'] = data['steps'] * data['distance']
data['high_steps'] = (data['steps'] > 3000).astype(int)

print(data[['steps', 'distance', 'steps_times_distance', 'high_steps']])
          steps  distance  steps_times_distance  high_steps
0     10.771429  0.008327              0.089692           0
1     11.475325  0.008896              0.102088           0
2     12.179221  0.009466              0.115287           0
3     12.883117  0.010035              0.129286           0
4     13.587013  0.010605              0.144088           0
...         ...       ...                   ...         ...
6259   1.000000  1.000000              1.000000           0
6260   1.000000  1.000000              1.000000           0
6261   1.000000  1.000000              1.000000           0
6262   1.000000  1.000000              1.000000           0
6263   1.000000  1.000000              1.000000           0

[6264 rows x 4 columns]

Code Explanation: Creates a new column in the DataFrame called 'steps_times_distance', which is the product of the 'steps' and 'distance' columns.Creates another new column in the DataFrame called 'high_steps', which is a binary column indicating whether the value in the 'steps' column is greater than 3000. If the value is greater than 3000, it assigns 1 (True), otherwise, it assigns 0 (False). Prints the columns 'steps', 'distance', 'steps_times_distance', and 'high_steps' from the DataFrame.We create new features based on existing columns to enhance the dataset's predictive power. Creating new features helps us derive more meaningful insights from the dataset. Here, steps_times_distance is a new feature representing the product of steps and distance, which might correlate with energy expenditure. The high_steps feature is a binary indicator showing whether the number of steps is greater than 3000, indicating high activity. Displaying the DataFrame helps us verify that the new features have been correctly added and calculated.

Why it's important: Creating new columns based on existing data helps to derive additional insights and features for analysis and modeling.The 'steps_times_distance' column can provide useful information for understanding the overall physical activity, combining both steps taken and distance covered.The 'high_steps' column helps to identify records where the number of steps taken is significantly high, which can be useful for categorizing and analyzing high-activity behavior.Enhancing the dataset with these new features can improve the effectiveness of predictive models and provide deeper insights into the data.

Task 8: Modelling : If the project includes predictive analysis, consider a regression model to predict calories burned or a classification model to predict activity type.¶

Task 8.1 : Consider a regression model to predict Calories Burned¶

In [178]:
data.ffill(inplace=True)
    
X = pd.get_dummies(data.drop(['calories'], axis=1))
y = data['calories']

Code Explanation: Uses the forward fill method (ffill) to propagate the last valid observation forward to fill missing values in the DataFrame. The inplace=True parameter ensures the operation is performed directly on the original DataFrame.Creates a new DataFrame X that contains all columns from the original DataFrame except for 'calories'. The pd.get_dummies function is used to convert categorical variables into dummy/indicator variables, enabling the inclusion of categorical data in the analysis.Assigns the 'calories' column from the original DataFrame to the variable y, representing the target variable for analysis or modeling.We handle missing values using the forward fill method (ffill) and encode categorical variables using one-hot encoding, excluding the target variable 'calories'.

Why it's important: Filling missing values using the forward fill method ensures that the dataset remains complete, preventing issues caused by missing data during analysis.Converting categorical variables to dummy variables is crucial for many machine learning algorithms, which require numerical input. This step ensures that all features are in a suitable format for analysis and modeling.Separating the target variable ('calories') from the feature set (X) is a standard practice in machine learning, allowing for accurate training and evaluation of models

In [179]:
print("Features (X) Column Names:", X.columns)
print("Target (y) Sample:", y[:5])
Features (X) Column Names: Index(['Unnamed: 0', 'X1', 'age', 'gender', 'height', 'weight', 'steps',
       'hear_rate', 'distance', 'entropy_heart', 'entropy_setps',
       'resting_heart', 'corr_heart_steps', 'norm_heart', 'intensity_karvonen',
       'sd_norm_heart', 'steps_times_distance', 'high_steps',
       'device_apple watch', 'device_fitbit', 'activity_Lying',
       'activity_Running 3 METs', 'activity_Running 5 METs',
       'activity_Running 7 METs', 'activity_Self Pace walk',
       'activity_Sitting'],
      dtype='object')
Target (y) Sample: 0     0.344533
1     3.287625
2     9.484000
3    10.154556
4    10.825111
Name: calories, dtype: float64

Code Explanation: Prints a message indicating that the next output will show the column names of the features (X).Displays the column names of the feature set (X), which includes all columns except the target variable ('calories'), with categorical variables converted to dummy variables.Prints a message indicating that the next output will show a sample of the target variable (y).Displays the first five values of the target variable (y), which represents the 'calories' column from the original dataset.We verify the final feature set and target variable by printing the column names of the features and a sample of the target variable.

Why it's important: Provides an overview of the feature set (X) columns, ensuring that all necessary features are included and correctly formatted for analysis and modeling.Verifies the correct separation of features and the target variable, which is crucial for building and evaluating predictive models.Allows a quick inspection of the target variable (y) to ensure it is correctly extracted and ready for analysis.

In [180]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Code Explanation: Splits the dataset into training and testing sets for both features (X) and target variable (y).The test_size=0.2 parameter indicates that 20% of the data will be used for testing, while the remaining 80% will be used for training.The random_state=42 parameter ensures reproducibility by fixing the random seed, so the data split will be the same every time the code is run.We split the data into training and testing sets using a 80-20 split ratio and set a random state for reproducibility.

Why it's important: Separating the data into training and testing sets is crucial for evaluating the performance of machine learning models.The training set is used to train the model, while the testing set is used to evaluate its performance on unseen data.Ensuring reproducibility with a fixed random seed allows for consistent results and easier comparison of different models.

In [181]:
base_regressors = [
    ('lr', LinearRegression()),
    ('dt', DecisionTreeRegressor()),
    ('rf', RandomForestRegressor(n_jobs=-1, random_state=42))
]

Code Explanation: Defines a list named base_regressors containing tuples of regressor names and their corresponding model instances.The list includes three different regression models: Linear Regression (lr), Decision Tree Regressor (dt), and Random Forest Regressor (rf).The Random Forest Regressor is configured to utilize all available processor cores for parallel processing (n_jobs=-1) and to ensure reproducibility by fixing the random seed (random_state=42).We define the meta-regressor, which will be used to combine the predictions from the base regressors.

Why it's important: Combining multiple regression models allows for ensemble methods, which can improve the overall predictive performance by leveraging the strengths of each model.The diverse nature of the included regressors—linear, tree-based, and ensemble—ensures a comprehensive approach to capturing various data patterns and relationships.Utilizing parallel processing in the Random Forest Regressor enhances computational efficiency, making the model training faster.Ensuring reproducibility with a fixed random seed allows for consistent results and easier comparison of different models.

In [182]:
stacking_regressor = StackingRegressor(estimators=base_regressors, final_estimator=meta_regressor)

stacking_regressor.fit(X_train, y_train)
Out[182]:
StackingRegressor(estimators=[('lr', LinearRegression()),
                              ('dt', DecisionTreeRegressor()),
                              ('rf',
                               RandomForestRegressor(n_jobs=-1,
                                                     random_state=42))],
                  final_estimator=RandomForestRegressor(n_jobs=-1,
                                                        random_state=42))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
StackingRegressor(estimators=[('lr', LinearRegression()),
                              ('dt', DecisionTreeRegressor()),
                              ('rf',
                               RandomForestRegressor(n_jobs=-1,
                                                     random_state=42))],
                  final_estimator=RandomForestRegressor(n_jobs=-1,
                                                        random_state=42))
LinearRegression()
DecisionTreeRegressor()
RandomForestRegressor(n_jobs=-1, random_state=42)
RandomForestRegressor(n_jobs=-1, random_state=42)

Code Explanation: Creates an instance of StackingRegressor, using the previously defined list of base regressors and a meta-regressor.The base_regressors list contains different regression models, and the meta_regressor is the model used to combine their predictions.Fits the StackingRegressor on the training data (X_train and y_train), training the ensemble model.We create and train the stacking regressor using the base regressors and the meta-regressor.

Why it's important: Stacking is an ensemble learning technique that combines multiple regression models to improve predictive performance.By using diverse base regressors, the model can capture various patterns and relationships in the data.The meta-regressor enhances the final predictions by learning how to best combine the outputs of the base regressors.This approach often leads to better generalization and robustness compared to using a single regression model.

In [183]:
y_pred = stacking_regressor.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
Mean Squared Error: 133.85032493856016
R-squared: 0.8038841075991768

Code Explanation: Uses the stacking_regressor to predict the target variable (y_pred) for the test dataset (X_test).Calculates the Mean Squared Error (MSE) between the actual target values (y_test) and the predicted values (y_pred).Calculates the R-squared (R²) value to assess the goodness of fit for the predictions.Prints the Mean Squared Error and R-squared values.We make predictions on the test data using the trained stacking regressor and evaluate the model's performance using Mean Squared Error (MSE) and R-squared (R²).

Why it's important: Predicting the target variable for the test dataset allows for the evaluation of the model's performance on unseen data.The Mean Squared Error is a common metric for evaluating the accuracy of regression models. Lower MSE values indicate better model performance.The R-squared value measures how well the predicted values approximate the actual data. Higher R² values indicate better model fit.These metrics provide valuable insights into the model's effectiveness, helping to determine if the model is suitable for the given task or if further improvements are needed.

In [184]:
results = pd.DataFrame({
    'steps': X_test['steps'],
    'distance': X_test['distance'],
    'steps_times_distance': X_test['steps_times_distance'],
    'Predicted Calories Burned': y_pred
})

print(results.head())
           steps   distance  steps_times_distance  Predicted Calories Burned
2304  224.385714   0.228569             51.287522                   0.384515
3621  119.000000   0.047120              5.607280                   0.653730
4671    4.315789  15.754386             67.992613                  78.585000
2707  609.200000   0.425868            259.438786                  19.333006
2596    6.416667   0.005880              0.037727                   0.240978

Code Explanation: Creates a new DataFrame called results that includes selected columns from the test dataset (X_test), specifically 'steps', 'distance', 'steps_times_distance', and the predicted calories burned (y_pred).Prints the first few rows of the results DataFrame to provide a quick overview of the predicted values alongside the actual feature values.We print the predicted calories burned by the model for the test set along with the specified columns, displaying the first few predictions to illustrate the model's output.

Why it's important: Organizing predictions and associated feature values in a separate DataFrame allows for easier analysis and comparison.By including the 'steps' and 'distance' columns, along with 'steps_times_distance' and the predicted calories burned, it becomes straightforward to evaluate the relationships and patterns in the data.Printing the first few rows of the results DataFrame provides an immediate visual check on the predictions, ensuring they are reasonable and aligned with the expected values.This approach aids in validating the performance of the predictive model and understanding how the features contribute to the predictions.

Calorie Prediction: Used a Stacking Regressor combining Linear Regression, Decision Tree Regressor, and Random Forest Regressor as base models, with Random Forest Regressor as the meta-model. The model was evaluated using Mean Squared Error (MSE) and R-squared (R²)¶

Task 8.2: Consider a classification model to predict Activity Type.¶

In [197]:
data.columns = data.columns.str.strip()

print("Column Names:", data.columns)

data.ffill(inplace=True)
Column Names: Index(['Unnamed: 0', 'X1', 'age', 'gender', 'height', 'weight', 'steps',
       'hear_rate', 'calories', 'distance', 'entropy_heart', 'entropy_setps',
       'resting_heart', 'corr_heart_steps', 'norm_heart', 'intensity_karvonen',
       'sd_norm_heart', 'steps_times_distance', 'device', 'activity',
       'high_steps'],
      dtype='object')

Code Explanation: Strips any leading and trailing whitespace from the column names in the DataFrame.Prints a message indicating that the next output will show the cleaned column names of the dataset.Displays the column names after removing any whitespace.Uses the forward fill method (ffill) to propagate the last valid observation forward to fill any missing values in the DataFrame.The inplace=True parameter ensures the operation is performed directly on the original DataFrame.We clean the column names to remove any leading or trailing whitespace and print the column names to verify the changes. We handle any missing values in the dataset using the forward fill method (ffill), which fills missing values with the last valid observation.

Why it's important: Cleaning the column names by removing whitespace ensures consistency and avoids potential issues when accessing columns.Displaying the cleaned column names allows for verification that the columns are correctly formatted.Filling missing values using the forward fill method ensures that the dataset remains complete, preventing issues caused by missing data during analysis.Ensures the data is preprocessed and ready for further analysis or modeling, enhancing the quality and reliability of the results.

In [198]:
X = pd.get_dummies(data.drop(['activity'], axis=1))

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(data['activity'])

Code Explanation: Creates a new DataFrame X that contains all columns from the original DataFrame except for 'activity'.The pd.get_dummies function is used to convert categorical variables into dummy/indicator variables, enabling the inclusion of categorical data in the analysis.Initializes a LabelEncoder instance called label_encoder.Uses the label_encoder to convert the 'activity' column from the original DataFrame into numeric labels and assigns the transformed values to the variable y.We use one-hot encoding to convert categorical variables into numerical form, excluding the 'activity' column. We then label encode the 'activity' column to convert it into numeric values.

Why it's important: Converting categorical variables to dummy variables is crucial for many machine learning algorithms, which require numerical input. This step ensures that all features are in a suitable format for analysis and modeling.Using LabelEncoder for the 'activity' column converts the categorical activity labels into numeric form, making them compatible with machine learning models.Separating the target variable ('activity') from the feature set (X) is a standard practice in machine learning, allowing for accurate training and evaluation of models.

In [199]:
print("Features (X) Column Names:", X.columns)
print("Target (y) Sample:", y[:5])
Features (X) Column Names: Index(['Unnamed: 0', 'X1', 'age', 'gender', 'height', 'weight', 'steps',
       'hear_rate', 'calories', 'distance', 'entropy_heart', 'entropy_setps',
       'resting_heart', 'corr_heart_steps', 'norm_heart', 'intensity_karvonen',
       'sd_norm_heart', 'steps_times_distance', 'high_steps',
       'device_apple watch', 'device_fitbit'],
      dtype='object')
Target (y) Sample: [0 0 0 0 0]

Code Explanation: Prints a message indicating that the next output will show the column names of the features (X).Displays the column names of the feature set (X), which includes all columns except the target variable ('activity'), with categorical variables converted to dummy variables.Prints a message indicating that the next output will show a sample of the target variable (y).Displays the first five values of the target variable (y), which represents the encoded 'activity' column from the original dataset.We verify the final feature set and target variable by printing the column names of the features and a sample of the target variable. The output [0 0 0 0 0] for print("Target (y) Sample:", y[:5]) suggests that all the first five values in the target variable y are the same, indicating that they have all been encoded to the same value by the LabelEncoder.This might be due to the 'activity' column containing only one unique value for the first few rows of your dataset, or potentially an issue with the encoding process.

Why it's important: Provides an overview of the feature set (X) columns, ensuring that all necessary features are included and correctly formatted for analysis and modeling.Verifies the correct separation of features and the target variable, which is crucial for building and evaluating predictive models.Allows a quick inspection of the target variable (y) to ensure it is correctly extracted and ready for analysis.

In [200]:
print("Unique values in 'activity' column:", data['activity'].unique())

print("Value counts of 'activity' column:\n", data['activity'].value_counts())

label_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Label Mapping:", label_mapping)

for i in range(5):
    print(f"Original: {data['activity'].iloc[i]}, Encoded: {y[i]}")
Unique values in 'activity' column: ['Lying', 'Sitting', 'Self Pace walk', 'Running 3 METs', 'Running 5 METs', 'Running 7 METs']
Categories (6, object): ['Lying', 'Running 3 METs', 'Running 5 METs', 'Running 7 METs', 'Self Pace walk', 'Sitting']
Value counts of 'activity' column:
 activity
Lying             1379
Running 7 METs    1114
Running 5 METs    1002
Running 3 METs     950
Sitting            930
Self Pace walk     889
Name: count, dtype: int64
Label Mapping: {'Lying': np.int64(0), 'Running 3 METs': np.int64(1), 'Running 5 METs': np.int64(2), 'Running 7 METs': np.int64(3), 'Self Pace walk': np.int64(4), 'Sitting': np.int64(5)}
Original: Lying, Encoded: 0
Original: Lying, Encoded: 0
Original: Lying, Encoded: 0
Original: Lying, Encoded: 0
Original: Lying, Encoded: 0

Code Explanation: To diagnose above issue further, let's take a few steps: First, let's inspect the unique values in the 'activity' column. Print the distribution of the 'activity' column to see if there is any imbalance. Print the mapping of the encoded values to their original labels. Print the first five values with their corresponding labels for better clarity. Adding these steps will help you understand why the target variable values are showing up as [0 0 0 0 0]. Here’s how we can integrate them. These steps should help us identify any issues with the encoding process or the distribution of our target variable.Prints a message indicating that the next output will show the unique values in the 'activity' column.Displays the unique values present in the 'activity' column of the DataFrame.Prints a message indicating that the next output will show the value counts of the 'activity' column.Displays the count of each unique value in the 'activity' column.Creates a dictionary (label_mapping) that maps the original class labels to their corresponding encoded values.Prints the label mapping dictionary to show the relationship between original and encoded labels.Iterates over the first five rows of the DataFrame and prints the original 'activity' value and its corresponding encoded value for each row.

Why it's important: Identifying the unique values in the 'activity' column helps understand the different categories or classes present in the dataset.Displaying the value counts provides insights into the distribution of activities, revealing any imbalances or trends.Creating and printing the label mapping ensures transparency in the encoding process, showing how original categorical values are transformed into numeric labels.Iterating over sample rows and printing both the original and encoded values allows for verification of the encoding process, ensuring accuracy and correctness.

In [201]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)

rf_classifier.fit(X_train, y_train)
Out[201]:
RandomForestClassifier(n_jobs=-1, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(n_jobs=-1, random_state=42)

Code Explanation: Splits the dataset into training and testing sets for both features (X) and target variable (y).The test_size=0.3 parameter indicates that 30% of the data will be used for testing, while the remaining 70% will be used for training.The random_state=42 parameter ensures reproducibility by fixing the random seed, so the data split will be the same every time the code is run.Initializes a RandomForestClassifier instance called rf_classifier with 100 decision trees (n_estimators=100), ensuring reproducibility with the random_state=42 parameter, and utilizing all available processor cores for parallel processing (n_jobs=-1).Fits the rf_classifier on the training data (X_train and y_train), training the random forest model.We split the data into training and testing sets using a 70-30 split ratio and set a random state for reproducibility. We initialize the Random Forest Classifier with 100 trees and train the model using the training data.

Why it's important: Separating the data into training and testing sets is crucial for evaluating the performance of machine learning models.The training set is used to train the model, while the testing set is used to evaluate its performance on unseen data.Ensuring reproducibility with a fixed random seed allows for consistent results and easier comparison of different models.Using a Random Forest Classifier leverages the power of ensemble learning, which improves predictive performance by combining multiple decision trees.The number of decision trees (n_estimators=100) helps create a robust and accurate model, while parallel processing (n_jobs=-1) enhances computational efficiency, making the model training faster.

In [202]:
y_pred = rf_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=label_encoder.classes_)

predicted_activities = pd.DataFrame({'Actual': label_encoder.inverse_transform(y_test), 'Predicted': label_encoder.inverse_transform(y_pred)})
print(predicted_activities)
              Actual       Predicted
0            Sitting         Sitting
1     Running 3 METs  Running 3 METs
2     Running 5 METs  Running 5 METs
3              Lying  Self Pace walk
4     Running 7 METs  Running 7 METs
...              ...             ...
1875  Running 7 METs  Running 7 METs
1876           Lying           Lying
1877           Lying           Lying
1878  Running 7 METs  Running 7 METs
1879           Lying           Lying

[1880 rows x 2 columns]

Code Explanation: Uses the rf_classifier to predict the target variable (y_pred) for the test dataset (X_test).Calculates the accuracy score, which measures the proportion of correctly predicted instances among the total number of instances.Generates a classification report, which includes precision, recall, f1-score, and support for each class.The target_names parameter ensures the report uses the original activity labels.Creates a new DataFrame, predicted_activities, containing both the actual and predicted activity labels, which are converted back to their original form using the label encoder.Prints the predicted_activities DataFrame to provide a comparison between the actual and predicted activity labels for each instance.We make predictions on the test data using the trained Random Forest model. We evaluate the model's performance by calculating the accuracy and generating a classification report, which includes precision, recall, and F1-score for each class. We create a DataFrame to compare the actual and predicted activity types and print it to verify the predictions.

Why it's important: Predicting the target variable for the test dataset allows for the evaluation of the model's performance on unseen data.The accuracy score provides a simple and intuitive measure of the model's overall performance.The classification report offers detailed insights into the model's performance for each class, highlighting strengths and areas for improvement.Creating a DataFrame with actual and predicted labels enables a clear comparison, helping to identify instances where the model performed well or struggled.This approach aids in validating the effectiveness of the predictive model and understanding how well it generalizes to new data.

In [203]:
if accuracy >= 0.90:
    accuracy_percentage = accuracy * 100
    print(f'Accuracy: {accuracy_percentage:.2f}%')
    print(f'Classification Report:\n{report}')
else:
    accuracy_percentage = accuracy * 100
    print(f'Accuracy is below 90%: {accuracy}')
    
    rf_classifier = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
    rf_classifier.fit(X_train, y_train)
    y_pred = rf_classifier.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_percentage = accuracy * 100
    report = classification_report(y_test, y_pred, target_names=label_encoder.classes_)
    print(f'New Accuracy: {accuracy_percentage:.2f}%')
    print(f'New Classification Report:\n{report}')
Accuracy is below 90%: 0.8680851063829788
New Accuracy: 86.97%
New Classification Report:
                precision    recall  f1-score   support

         Lying       0.84      0.82      0.83       422
Running 3 METs       0.88      0.88      0.88       256
Running 5 METs       0.85      0.89      0.87       295
Running 7 METs       0.92      0.95      0.93       356
Self Pace walk       0.90      0.90      0.90       272
       Sitting       0.82      0.78      0.80       279

      accuracy                           0.87      1880
     macro avg       0.87      0.87      0.87      1880
  weighted avg       0.87      0.87      0.87      1880

Code Explanation: Checks if the model accuracy is greater than or equal to 90%.If the accuracy is above or equal to 90%, it calculates the accuracy percentage and prints it, along with the classification report.If the accuracy is below 90%, it prints a message indicating the lower accuracy and then:

Initializes a new RandomForestClassifier instance with 200 decision trees (n_estimators=200), ensuring reproducibility with the random_state=42 parameter, and utilizing all available processor cores for parallel processing (n_jobs=-1).

Fits the new rf_classifier on the training data (X_train and y_train), training the updated random forest model.

Predicts the target variable (y_pred) for the test dataset (X_test) using the updated model.

Calculates the new accuracy score and accuracy percentage.

Generates a new classification report.

Prints the new accuracy percentage and classification report. We check if the model's accuracy is above 90%. If it is, we print the accuracy and classification report. If the accuracy is below 90%, we retrain the model with more estimators to improve the accuracy and print the new results.

Why it's important: Evaluating model performance against a threshold (e.g., 90% accuracy) ensures that the model meets desired performance standards.Reinitializing and retraining the model with more decision trees can improve model performance, especially if the initial accuracy is below expectations.The updated model may provide better generalization and robustness, leading to higher accuracy.Printing the accuracy and classification report before and after retraining provides a clear comparison, helping to assess the effectiveness of the changes made to the model.

Activity Type Prediction: Used a Random Forest Classifier to predict activity types. The model was trained and evaluated using accuracy, precision, recall, and F1-score.¶

Conclusion¶

Project Summary and Key Takeaways:¶

  • Data Exploration and Preparation:

We started by exploring the dataset, ensuring it contained all necessary columns and understanding its structure.

We cleaned the dataset by filling missing values and converting categorical columns to the appropriate data types.

Additional features were created (e.g., steps_times_distance and high_steps) to enrich the dataset.

  • Feature Selection and Encoding:

We selected relevant features and converted categorical variables into dummy/indicator variables.

The target variable ('activity') was label-encoded for compatibility with machine learning models.

  • Model Training and Evaluation:

We trained a Random Forest Classifier on the training data to predict the activity based on the selected features.

The model's performance was evaluated using accuracy and a detailed classification report.

If the initial model accuracy was below the desired threshold, we re-trained the model with additional decision trees, leading to improved performance.

  • Prediction and Analysis:

Predictions were made on the test dataset, and the accuracy of the model was assessed.

A comparison between actual and predicted activity labels provided insights into the model's effectiveness.

Key metrics such as Mean Squared Error and R-squared were used to evaluate regression models in the project.

  • Visualization:

Various visualizations, including scatter plots, heatmaps, and trend analysis plots, were created to understand relationships between variables and to present the data intuitively.

These visualizations helped identify patterns, trends, and potential outliers, providing valuable insights for further analysis.

Key Takeaways:¶

  1. Data Quality: Ensuring data is clean and correctly formatted is crucial for reliable analysis and modeling.

  2. Feature Engineering: Creating new features based on existing data can significantly enhance the predictive power of models.

  3. Model Evaluation: Continuous evaluation and tuning of models are essential to achieve desired performance standards.

  4. Visualization: Effective visualizations can reveal underlying patterns and relationships in the data, aiding in better decision-making and communication of results.

This project demonstrated the importance of a systematic approach to data analysis, from initial exploration and cleaning to feature selection, model training, and evaluation.

Results and Insights¶

Exploratory Data Analysis (EDA)¶

Distribution of Activities: Identified the distribution of different activity types.

Correlation Analysis: Examined correlations between features and target variables.

Anomalies: Detected and handled any anomalies or outliers in the data.

Model Performance¶

Stacking Regressor:¶

Mean Squared Error (MSE): [133.85032493856016]

R-squared (R²): [0.8038841075991768]

Predicted Calories Burned showed consistent and accurate predictions.

Random Forest Classifier:¶

Initial Accuracy: 86.81%

Improved Accuracy after tuning: 86.97%

Classification Report indicated balanced precision and recall across activity types

Recommendations¶

Based on the analysis and model results, the following recommendations are made:

For Users: Monitor and track steps, distance, and heart rate to optimize health and fitness activities.

For Developers: Enhance smartwatch sensors for more accurate data collection.

For Health Practitioners: Use smartwatch data insights to provide personalized health advice to users.

Final Report¶

Key Findings¶

Machine learning models, including Random Forest Classifier and Stacking Regressor, were effective in predicting activity types and calories burned.

The EDA highlighted significant correlations between features and target variables, providing insights into user activity patterns.

Visuals¶

Included visualizations such as correlation heatmaps, distribution plots, and model performance charts to support the findings.

Limitations¶

The analysis assumed that the data was accurately recorded by the smartwatch sensors. Any sensor errors could affect the model's accuracy.

The dataset's diversity in terms of user demographics and activity types was not considered, which might limit the model's generalizability.

Potential for Further Exploration¶

Further exploration could involve incorporating more features, such as sleep patterns and diet data, to enhance prediction accuracy.

Investigate the impact of different machine learning algorithms and hyperparameter tuning to improve model performance.